Addressing Missing
Values
# Plot Percentage of missing values for each variable
# Create a data frame with missing value percentages
missing_values <- data.frame(
variable = names(data),
missing = colMeans(is.na(data)) * 100
)
# Convert variable to factor with reversed levels
missing_values$variable <- factor(missing_values$variable, levels = rev(missing_values$variable))
# Plot without ordering and display percentages on the bars
ggplot(missing_values, aes(x = variable, y = missing)) +
geom_bar(stat = "identity", fill = "#4AA2D0") +
geom_text(aes(label = sprintf("%.1f%%", missing)), hjust = 1, size = 4, color='#263238') +
coord_flip() +
labs(title = "Percentage of Missing Values by Variable",
x = "Variable",
y = "Percentage Missing") +
theme_minimal() +
theme(plot.title = element_text(size = 11))

We observe that the math1 variable has missing values
(approximately 43.1%), as well as other variables of interest. For now,
we only remove rows with missing values in the math1
variable. Other variables may still contain missing values, which will
be addressed in the final report before being incorporated into a
model/analysis.
# Remove rows with missing values in math1
data <- data %>% filter(!is.na(math1))
# Plot Percentage of missing values for each variable
# Create a data frame with missing value percentages
missing_values <- data.frame(
variable = names(data),
missing = colMeans(is.na(data)) * 100
)
# Convert variable to factor with reversed levels
missing_values$variable <- factor(missing_values$variable, levels = rev(missing_values$variable))
# Plot without ordering and display percentages on the bars
ggplot(missing_values, aes(x = variable, y = missing)) +
geom_bar(stat = "identity", fill = "#4AA2D0") +
geom_text(aes(label = sprintf("%.1f%%", missing)), hjust = -0.1, size = 4, color='#263238') +
coord_flip() +
labs(title = "Percentage of Missing Values by Variable (after removal of missing values in math1)",
x = "Variable",
y = "Percentage Missing") +
theme_minimal() +
theme(plot.title = element_text(size = 11))

After removing rows with missing values in the math1
variable, we are left with 6,600 student records.
Creating unique
teacher IDs (will not be required with data from Harvard dataverse)
# Attach unique teacher_id variable to the data based on combinations of star1, experience1, tethnicity1, schoolid1
data <- data %>% mutate(teacher_id = paste(star1, experience1, tethnicity1, schoolid1, sep = "_"))
# Check the number of unique teacher IDs
unique_teachers <- unique(data$teacher_id)
To address the lack of a given teacher/class ID in the dataset from
the AER package, we created a unique teacher_id variable
based on the combination of star1,
experience1, tethnicity1, and
schoolid1. There are 366 unique teacher IDs (number of
classes) in the dataset. This variable will be used to aggregate
students’ performance in the following steps.
Creating new variable
of interest: Teacher-Student Ethnicity Overlap Percentage
# Compute number of students by ethnicity and their respective percentage
student_ethnicities <- data %>%
group_by(ethnicity) %>%
summarise(
count = n(), # Number of students per ethnicity
percentage = (n() / nrow(data)) * 100 # Convert to percentage
) %>%
arrange(desc(count)) # Sort by count in descending order
# Display as a formatted table
student_ethnicities %>%
kable(caption = "Table 1: Number of Students by Ethnicity and Percentage") %>%
kable_styling(full_width = FALSE)
Table 1: Number of Students by Ethnicity and Percentage
|
ethnicity
|
count
|
percentage
|
|
cauc
|
4402
|
66.6969697
|
|
afam
|
2153
|
32.6212121
|
|
asian
|
19
|
0.2878788
|
|
other
|
11
|
0.1666667
|
|
hispanic
|
9
|
0.1363636
|
|
amindian
|
4
|
0.0606061
|
|
NA
|
2
|
0.0303030
|
# Compute number of teachers by ethnicity and their respective percentage
teacher_ethnicities <- data %>%
group_by(tethnicity1) %>%
summarise(
count = n(), # Number of students per ethnicity
percentage = (n() / nrow(data)) * 100 # Convert to percentage
) %>%
arrange(desc(count)) # Sort by count in descending order
# Display as a formatted table
teacher_ethnicities %>%
kable(caption = "Table 2: Number of Teachers by Ethnicity and Percentage") %>%
kable_styling(full_width = FALSE)
Table 2: Number of Teachers by Ethnicity and Percentage
|
tethnicity1
|
count
|
percentage
|
|
cauc
|
5420
|
82.1212121
|
|
afam
|
1138
|
17.2424242
|
|
NA
|
42
|
0.6363636
|
We notice that there is an imbalance in the number/proportion of
ethnicities in both the student and teacher populations. We also note
that there are 6 unique ethnicity values for students (caucasian,
african american, asian, hispanic, native american, and other) while
there are only 2 unique ethnicity values for teachers (caucasian,
african american). The lack of variety in teacher ethnicities could be
attributed to the location and time period of the study, where the
teacher population may have been predominantly caucasian and african
american.
Rather than taking the ethnicity or
tethnicity1 variable as is (the ethnicity itself of
students and teachers are not of primary interest), we want to
investigate if a teacher and student sharing the same ethnicity affects
student performance for that class overall. In order to answer this
question, we create a new variable ethnicity_overlap that
measures the percentage of students in a class that share the same
ethnicity as the teacher. This variable will allow us to take a step in
quantifying any presence of biased or unbiased teaching practices based
on shared characteristics betweeen teachers and students.
# First create a binary variable column indicating whether the student and teacher ethnicity match
data <- data %>%
mutate(
ethnicity = as.character(ethnicity),
tethnicity1 = as.character(tethnicity1),
ethnicity_match = ifelse(ethnicity == tethnicity1, 1, 0) # Create binary match variable
)
teacher_ethnicity_overlap <- data %>%
group_by(teacher_id) %>%
summarise(ethnicity_overlap = mean(ethnicity_match, na.rm = TRUE) * 100) # Convert to percentage
data <- data %>%
left_join(teacher_ethnicity_overlap, by = "teacher_id")
head(data, 10) %>%
dplyr::select(teacher_id, ethnicity_overlap, tethnicity1) %>%
kable(caption = "Table 3: Teacher-Student Ethnicity Overlap Percentage (first 10 rows)") %>%
kable_styling(full_width = FALSE)
Table 3: Teacher-Student Ethnicity Overlap Percentage (first 10 rows)
|
teacher_id
|
ethnicity_overlap
|
tethnicity1
|
|
small_7_cauc_63
|
86.66667
|
cauc
|
|
small_32_afam_20
|
100.00000
|
afam
|
|
regular+aide_8_cauc_5
|
95.65217
|
cauc
|
|
regular_7_cauc_50
|
90.47619
|
cauc
|
|
regular_11_cauc_69
|
100.00000
|
cauc
|
|
small_15_cauc_79
|
95.00000
|
cauc
|
|
regular_0_cauc_5
|
90.00000
|
cauc
|
|
regular_5_cauc_16
|
0.00000
|
cauc
|
|
regular_17_cauc_48
|
100.00000
|
cauc
|
|
regular_1_afam_51
|
52.38095
|
afam
|
Further visualizations and analysis on this newly created variable
will be conducted in the following sections.
Student Performance
by Teacher and Appropriate Summary Measure
# Aggregate students' performance by teacher
teacher_performance <- data %>%
group_by(teacher_id) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE),
n = n())
# Display the first few rows of the aggregated data
head(teacher_performance, 10) %>%
kable(caption = "Table 4: Aggregated Student Performance by Teacher (first 10 rows)") %>%
kable_styling(full_width = FALSE)
Table 4: Aggregated Student Performance by Teacher (first 10 rows)
|
teacher_id
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
n
|
|
regular+aide_0_cauc_48
|
495.0000
|
487.0
|
34.52304
|
475.25
|
509.25
|
26
|
|
regular+aide_0_cauc_77
|
548.5455
|
542.0
|
29.30907
|
529.75
|
578.00
|
22
|
|
regular+aide_10_cauc_1
|
545.9259
|
545.0
|
49.54402
|
501.00
|
584.00
|
27
|
|
regular+aide_10_cauc_32
|
497.9130
|
490.0
|
26.40405
|
482.50
|
516.50
|
23
|
|
regular+aide_10_cauc_67
|
536.5500
|
542.0
|
33.15272
|
512.00
|
554.00
|
20
|
|
regular+aide_10_cauc_9
|
544.7273
|
547.0
|
34.10095
|
535.00
|
560.75
|
22
|
|
regular+aide_11_afam_49
|
517.6538
|
518.0
|
32.55757
|
495.00
|
535.00
|
26
|
|
regular+aide_11_cauc_13
|
551.3500
|
545.5
|
39.88177
|
528.25
|
578.00
|
20
|
|
regular+aide_11_cauc_36
|
533.1364
|
524.5
|
35.25053
|
507.00
|
555.00
|
22
|
|
regular+aide_11_cauc_37
|
535.9500
|
519.0
|
58.57831
|
491.75
|
572.00
|
20
|
# Distribution of the mean math scores by teacher
p1 <- ggplot(teacher_performance, aes(x = mean_math1)) +
geom_histogram(fill = "#4AA2D0", color = "#263238", bins = 30) +
labs(title = "Distribution of Mean Math Scores by Teacher/Class",
x = "Mean Math Score",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(size = 10))
# Distribution of the median math scores by teacher
p2 <- ggplot(teacher_performance, aes(x = median_math1)) +
geom_histogram(fill = "#2B5798", color = "#263238", bins = 30) +
labs(title = "Distribution of Median Math Scores by Teacher/Class",
x = "Median Math Score",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(size = 10))
# Display the two plots side by side
p1 + plot_spacer() + p2 + plot_layout(widths = c(1, 0.3, 1))

# Quantile-quantile plot of the mean math scores by teacher
qqPlot(teacher_performance$mean_math1, distribution = "norm", main = "Q-Q Plot of Mean Math Scores by Teacher")

## [1] 171 20
We observe that the distribution of mean student math scores by
teacher is approximately normal and there are no visually significant
outliers (due to fixed minimum and maximum scores). The data being
approximately normal is further supported with the quantile-quantile
plot, where the points lie close to the line. The mean is a better
summary measure than the median in this situation, given that its normal
distribution is more apparent, unlike the median that displays slightly
more variability across bin levels.
Univariate
Descriptive Statistics of Selected Variables
This section provides visual and tabular summaries of the selected
variables of interest.
STAR Class Type
(star1)
class_types_count <- data %>%
group_by(star1) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(class_types_count, aes(x = reorder(star1, count), y = count)) +
geom_bar(stat = "identity", fill = "#00D29A") +
geom_text(aes(label = count), hjust = 1.5, size = 4, color='#263238') + # Adjust text position
coord_flip() + # Rotate chart
labs(title = "Number of Students by STAR Class Type",
x = "Number of Students", # Swap x and y labels
y = "STAR Class Type") +
theme_minimal()

While not perfectly balanced, the distribution of students across
class types is relatively even. The majority of students are in regular
classes, followed by regular classes with an aide, and then small
classes. Taking note of the overall and class-wise student size may be
important for future analysis such as testing for homogeneity of
variance in ANOVA.
School ID
(schoolid1)
# Bar plot of teacher count (teacher_id) per schoolid1
# Bar plot of teacher count per school
teacher_count_per_school <- data %>%
group_by(schoolid1) %>%
summarise(teacher_count = n())
ggplot(teacher_count_per_school, aes(x = factor(schoolid1), y = teacher_count)) +
geom_bar(stat = "identity", fill = "#00D29A", color = "#263238") +
labs(title = "Number of Teachers per School",
x = "School ID",
y = "Teacher Count") +
theme_minimal()

The distribution of teachers across schools is relatively balanced,
with a few schools having more teachers than others. Overall, this means
that we have a good spread of teachers across schools, which is
important for generalizability of results.
Ethnicity Overlap
(ethnicity_overlap)
# Compute summary statistics for ethnicity_overlap
ethnicity_overlap_summary <- data %>%
summarise(
min = min(ethnicity_overlap, na.rm = TRUE),
q1 = quantile(ethnicity_overlap, 0.25, na.rm = TRUE),
median = median(ethnicity_overlap, na.rm = TRUE),
mean = mean(ethnicity_overlap, na.rm = TRUE),
q3 = quantile(ethnicity_overlap, 0.75, na.rm = TRUE),
max = max(ethnicity_overlap, na.rm = TRUE),
sd = sd(ethnicity_overlap, na.rm = TRUE),
missing = sum(is.na(experience1))
)
# Display the summary statistics in a formatted table
ethnicity_overlap_summary %>%
kable(caption = "Table 5: Summary Statistics for Ethnicity Overlap") %>%
kable_styling(full_width = FALSE)
Table 5: Summary Statistics for Ethnicity Overlap
|
min
|
q1
|
median
|
mean
|
q3
|
max
|
sd
|
missing
|
|
0
|
71.42857
|
93.33333
|
78.15416
|
100
|
100
|
31.9979
|
12
|
# Boxplot
p11 <- ggplot(data, aes(y = ethnicity_overlap)) +
geom_boxplot(fill = "#00D29A", color = "#263238", outlier.color = "red", outlier.shape = 16) +
labs(title = "Boxplot of Teacher-Student Ethnicity Overlap",
y = "Ethnicity Overlap (%)") +
theme_minimal() +
theme(plot.title = element_text(size = 10))
# Histogram
p12 <- ggplot(data, aes(x = ethnicity_overlap)) +
geom_histogram(fill = "#00D29A", color = "#263238", bins = 30) +
labs(title = "Distribution of Teacher-Student Ethnicity Overlap",
x = "Ethnicity Overlap (%)",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(size = 10))
p11 | p12
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_bin()`).

For the newly created variable ethnicity_overlap, we
observe that the distribution is heavily skewed with a majority of
classes having the same ethnicity of students and teachers (100%
overlap) and there is a noticeable amount of classes with 0% overlap.
This variable will require a deeper investigation when analyzing its
relationship with student performance but may lead to interesting
insights on the impact of shared characteristics between teachers and
students.
Economic Status
(lunch1)
# bar plot of lunch1
lunch_count <- data %>%
group_by(lunch1) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(lunch_count, aes(x = reorder(lunch1, count), y = count)) +
geom_bar(stat = "identity", fill = "#00D29A") +
geom_text(aes(label = count), hjust = 1.1, size = 4, color='#263238') + # Adjust text position
coord_flip() + # Rotate chart
labs(title = "Number of Students by Economic Status",
x = "Number of Students", # Swap x and y labels
y = "Economic Status") +
theme_minimal()

While there are no direct indicators of students’ family economic
status, we can utilize the lunch1 variable as a proxy. The
distribution of students across economic status is relatively balanced,
with a very slight majority of students receiving free lunch. This
variable will be important to consider when analyzing the relationship
between economic status and student performance.
Teacher’s Highest
Education Level (degree1)
# Bar plot of degree1
degree_count <- data %>%
group_by(degree1) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(degree_count, aes(x = reorder(degree1, count), y = count)) +
geom_bar(stat = "identity", fill = "#00D29A") +
geom_text(aes(label = count), hjust = 0.2, size = 4, color='#263238') + # Adjust text position
coord_flip() + # Rotate chart
labs(title = "Number of Teachers by Highest Education Level",
x = "Number of Teachers", # Swap x and y labels
y = "Highest Education Level") +
theme_minimal()

We notice that the majority of teachers have a bachelor’s degree,
followed by a master’s degree. The lack of teachers in other education
levels may limit the capacity to analyze the impact of higher education
levels on student performance at a more granular level. However, this
variable will still be able to provide insight into whether or not a
teacher’s graduate degree makes a significant difference in student
performance.
Teacher’s Teaching
Experience (experience1)
# Table of summary statistics for experience1 including missing values
experience_summary <- data %>%
summarise(
min = min(experience1, na.rm = TRUE),
q1 = quantile(experience1, 0.25, na.rm = TRUE),
median = median(experience1, na.rm = TRUE),
mean = mean(experience1, na.rm = TRUE),
q3 = quantile(experience1, 0.75, na.rm = TRUE),
max = max(experience1, na.rm = TRUE),
sd = sd(experience1, na.rm = TRUE),
missing = sum(is.na(experience1))
)
kable(experience_summary, caption = "Table 6: Summary Statistics for Teaching Experience") %>%
kable_styling(full_width = FALSE)
Table 6: Summary Statistics for Teaching Experience
|
min
|
q1
|
median
|
mean
|
q3
|
max
|
sd
|
missing
|
|
0
|
4
|
10
|
11.62508
|
17
|
42
|
8.922621
|
12
|
# Boxplot of experience1
p9 <- ggplot(data, aes(y = experience1)) +
geom_boxplot(fill = "#00D29A", color = "#263238", outlier.color = "red", outlier.shape = 16) +
labs(title = "Boxplot of Teacher's Teaching Experience",
y = "Teaching Experience") +
theme_minimal() +
theme(plot.title = element_text(size = 11))
# Histogram of experience1
p10 <- ggplot(data, aes(x = experience1)) +
geom_histogram(fill = "#00D29A", color = "#263238", bins = 30) +
labs(title = "Distribution of Teacher's Teaching Experience",
x = "Teaching Experience",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(size = 11))
p9 | p10
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

We note the presence of some outlier teachers who have a
significantly longer teaching experience than others. Overall, the
distribution is slightly right-skewed, with a majority of teachers
having between 4 to 17 years of experience.
Teacher’s Career
Ladder Level (ladder1)
# Bar plot of ladder1
ladder_count <- data %>%
group_by(ladder1) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(ladder_count, aes(x = reorder(ladder1, count), y = count)) +
geom_bar(stat = "identity", fill = "#00D29A") +
geom_text(aes(label = count), hjust = 0.2, size = 4, color='#263238') + # Adjust text position
coord_flip() + # Rotate chart
labs(title = "Number of Teachers by Career Ladder Level",
x = "Number of Teachers", # Swap x and y labels
y = "Career Ladder Level") +
theme_minimal()

The majority of teachers are in the level 1 career ladder. Given the
lack of teachers in other career ladder levels, the teacher’s experience
level may be a more appropriate variable to analyze the impact of
teacher career progression on student performance.
Multivariate
Descriptive Statistics of Selected Variables
Math Scores
(math1) v.s. STAR Class Type (star1)
class_types <- data %>%
group_by(star1) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
class_types %>%
kable(caption = "Table 7: Summary Statistics of Math Scores by STAR Class Type") %>%
kable_styling(full_width = FALSE)
Table 7: Summary Statistics of Math Scores by STAR Class Type
|
star1
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
|
regular
|
525.2744
|
523
|
41.65123
|
495.00
|
553
|
|
small
|
538.6777
|
535
|
44.10308
|
509.25
|
567
|
|
regular+aide
|
529.6252
|
529
|
42.86598
|
497.00
|
557
|
ggplot(data, aes(x = star1, y = math1)) +
geom_boxplot(fill = "#AFAAFF", color = "#263238") +
labs(title = "Math Scores by STAR Class Type",
x = "STAR Class Type",
y = "Math Score") +
theme_minimal()

The side-by-side boxplot of math scores by class type shows that
students have similar distributions of math scores across different
class types. While we do see a slightly higher mean, median, and
interval range for students in small classes, we are unable to confirm
any significant differences in student performance across class types at
this stage. Hence this motivates the use of an ANOVA test and Tukey’s
HSD test to confirm if there are any significant differences in student
performance across class types.
Math Scores
(math1) v.s. School ID (schoolid1)
# Summary statistics of math scores by schoolid1
schoolid_count <- data %>%
group_by(schoolid1) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
head(schoolid_count, 10) %>%
kable(caption = "Table 8: Summary Statistics of Math Scores by School ID (first 10 rows)") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
Table 8: Summary Statistics of Math Scores by School ID (first 10 rows)
|
schoolid1
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
|
1
|
536.8969
|
538
|
42.58753
|
505.00
|
567.00
|
|
2
|
493.7500
|
489
|
28.39822
|
474.00
|
515.00
|
|
3
|
558.8818
|
555
|
41.22410
|
535.00
|
582.50
|
|
4
|
524.2714
|
526
|
39.07923
|
500.00
|
548.00
|
|
5
|
517.1892
|
512
|
32.66756
|
497.00
|
534.25
|
|
7
|
552.6102
|
553
|
50.48619
|
529.00
|
590.00
|
|
8
|
526.5161
|
526
|
34.76649
|
502.00
|
549.00
|
|
9
|
551.2000
|
551
|
40.99771
|
526.75
|
582.50
|
|
10
|
567.8966
|
572
|
46.50097
|
542.75
|
601.00
|
|
11
|
550.9351
|
549
|
51.75406
|
518.00
|
578.00
|
# Boxplots of math scores by schoolid1
ggplot(data, aes(x = factor(schoolid1), y = math1)) +
geom_boxplot(fill = "#AFAAFF", color = "#263238") +
labs(title = "Math Scores by School ID",
x = "School ID",
y = "Math Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Scatterplot of math scores by schoolid1
ggplot(data, aes(x = factor(schoolid1), y = math1)) +
geom_jitter(color = "#004D33", alpha = 0.6, width = 0.1, size = 0.9) +
labs(title = "Scatterplot of Math Scores by School ID",
x = "School ID",
y = "Math Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

There is no distinct pattern in terms of student performance across
schools. In fact, student performance seems to vary widely within each
school. Clearly, some schools have higher performing students than
others, and also warrant further investigation to determine if there are
any significant differences in student performance across schools.
Math Scores
(math1) v.s. Ethnicity Overlap
(ethnicity_overlap)
# Scatterplot of math scores against ethnicity overlap
ggplot(data, aes(x = ethnicity_overlap, y = math1)) +
geom_jitter(color = "#AFAAFF", alpha = 0.6, size = 1, width = 0.1, height = 0.1) + # Reduce overlap
labs(title = "Scatterplot of Math Scores vs. Ethnicity Overlap",
x = "Ethnicity Overlap",
y = "Math Score") +
theme_minimal()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Visually, there seems to be a wider tange of math scores in classes with
either 0% or 100% overlap in teacher and student ethnicity. This is a
very interesting trend that will be further explored in the inferential
analysis to determine if there is any differences that arise in
classrooms where teachers and students share ethnicities.
Math Scores
(math1) v.s. Economic Status (lunch1)
lunch_count <- data %>%
group_by(lunch1) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
lunch_count %>%
kable(caption = "Table 9: Summary Statistics of Math Scores by Economic Status") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
Table 9: Summary Statistics of Math Scores by Economic Status
|
lunch1
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
|
non-free
|
544.9734
|
545
|
41.54842
|
518
|
572
|
|
free
|
516.7250
|
515
|
39.98681
|
488
|
542
|
|
NA
|
528.0798
|
526
|
41.89183
|
502
|
553
|
ggplot(data, aes(x = lunch1, y = math1)) +
geom_boxplot(fill = "#AFAAFF", color = "#263238") +
labs(title = "Math Scores by Economic Status",
x = "Economic Status",
y = "Math Score") +
theme_minimal()

ggplot(data, aes(x = star1, y = math1, fill = lunch1)) +
geom_boxplot() +
labs(title = "Math Scores by Economic Status",
x = "Economic Status",
y = "Math Score") +
theme_minimal() +
scale_fill_manual(values = c("#AFAAFF", "#4D3898"))

Overall, we notice that students who qualify for free lunch have
higher mean, median, and interquartile range math scores compared to
students who do not qualify for free lunch, even across each class type.
This is an interesting trend that will be further explored.
Math Scores
(math1) v.s. Teacher’s Degree Level
(degree1)
degree_count <- data %>%
group_by(degree1) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
degree_count %>%
kable(caption = "Table 10: Summary Statistics of Math Scores by Degree Level") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
Table 10: Summary Statistics of Math Scores by Degree Level
|
degree1
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
|
bachelor
|
528.9805
|
526.0
|
42.53398
|
500.0
|
557.0
|
|
master
|
533.3354
|
532.0
|
44.18649
|
502.0
|
562.0
|
|
specialist
|
535.6486
|
529.0
|
43.12528
|
518.0
|
562.0
|
|
phd
|
533.1364
|
524.5
|
35.25053
|
507.0
|
555.0
|
|
NA
|
548.5000
|
549.0
|
26.78704
|
536.5
|
564.5
|
ggplot(data, aes(x = degree1, y = math1)) +
geom_boxplot(fill = "#AFAAFF", color = "#263238") +
labs(title = "Math Scores by Teacher's Degree Level",
x = "Teacher's Degree Level",
y = "Math Score") +
theme_minimal()

The distribution of math scores across teacher degree levels is
relatively similar. We will definitely need to conduct further analysis
to determine if there are any significant differences in student
performance across teacher degree levels.
Math Scores
(math1) v.s. Teacher’s Teaching Experience
(experience1)
# Scatterplot of math scores against teaching experience
ggplot(data, aes(x = experience1, y = math1)) +
geom_jitter(color = "#AFAAFF", alpha = 0.6, size = 1, width = 0.1, height = 0.1) + # Reduce overlap
labs(title = "Scatterplot of Math Scores vs. Teaching Experience",
x = "Teaching Experience",
y = "Math Score") +
theme_minimal()
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

There is no clear pattern or visual association between teacher’s
teaching experience and student math scores.
Math Scores
(math1) v.s. Teacher’s Career Ladder Level
(ladder1)
ladder_count <- data %>%
group_by(ladder1) %>%
summarise(mean_math1 = mean(math1, na.rm = TRUE),
median_math1 = median(math1, na.rm = TRUE),
sd_math1 = sd(math1, na.rm = TRUE),
q1_math1 = quantile(math1, 0.25, na.rm = TRUE),
q3_math1 = quantile(math1, 0.75, na.rm = TRUE))
ladder_count %>%
kable(caption = "Table 11: Summary Statistics of Math Scores by Career Ladder Level") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
Table 11: Summary Statistics of Math Scores by Career Ladder Level
|
ladder1
|
mean_math1
|
median_math1
|
sd_math1
|
q1_math1
|
q3_math1
|
|
level1
|
532.1803
|
529
|
43.17080
|
502.0
|
562
|
|
level2
|
529.8019
|
523
|
48.59501
|
495.5
|
557
|
|
level3
|
541.8566
|
538
|
45.44109
|
515.0
|
572
|
|
apprentice
|
528.4583
|
526
|
43.03445
|
500.0
|
553
|
|
probation
|
521.3891
|
520
|
39.11454
|
493.0
|
549
|
|
notladder
|
526.5183
|
529
|
41.82333
|
495.0
|
557
|
|
NA
|
501.0909
|
488
|
43.85513
|
468.0
|
538
|
ggplot(data, aes(x = ladder1, y = math1)) +
geom_boxplot(fill = "#AFAAFF", color = "#263238") +
labs(title = "Math Scores by Teacher's Career Ladder Level",
x = "Teacher's Career Ladder Level",
y = "Math Score") +
theme_minimal()

While level 3 teachers have a slightly higher mean math score
compared to other career ladder levels, the distribution of math scores
across career ladder levels is also relatively similar. Further analysis
will be required to determine if there are any significant differences
in student performance across teacher career ladder levels.